fix(cpu-ops): lazy transpose for Q8_0 packed tensors#736
Merged
Conversation
ops.transpose rewraps the packed bytes with a flipped shape for the K-series (Q4_K/Q5_K/Q6_K) and Q5_0/Q5_1, but Q8_0 fell through to the generic FP32 DenseTensorDataFactory path, which casts the Byte-backed buffer to Float and throws ClassCastException. Add the analogous Q8_0BlockTensorData case. This unblocks keeping a Q8_0 matmul weight packed through linearProject (matmul(x, transpose(W))) — notably FunctionGemma's tied Q8_0 lm_head, which otherwise has to dequant to FP32 (~0.67 GB) and OOMs the 1.9 GB SL2610 board. Verified: SKaiNET-transformers GemmaQ5KPackedParityTest (eager load(NATIVE_OPTIMIZED)) now packs the lm_head as Q8_0 and decodes byte-identically to the FP32 baseline. See SKaiNET-transformers#178. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
DefaultCpuOps.transposerewraps packed bytes with a flipped shape for the K-series (Q4_K/Q5_K/Q6_K) and Q5_0/Q5_1, but Q8_0 falls through to the generic FP32DenseTensorDataFactorypath, which casts the Byte-backed buffer to Float and throws:This blocks keeping a Q8_0 matmul weight packed through
linearProject(matmul(x, transpose(W))).Fix
Add the analogous
is Q8_0TensorData -> Q8_0BlockTensorData(Shape(cols, rows), d.packedData)case (one line + import). Bytes are layout-agnostic to the kernel's[out, in]block-major convention, so this is a metadata-only (lazy) transpose like the others.Why it matters
Unblocks FunctionGemma's tied Q8_0 lm_head staying packed in the eager
NATIVE_OPTIMIZEDpath instead of dequanting to FP32 (~0.67 GB), which OOMs the 1.9 GB Astra Machina SL2610.Verification
SKaiNET-transformers
GemmaQ5KPackedParityTest(composite-PuseLocalSkainet=true) now packs the lm_head as Q8_0 and decodes byte-identically to the FP32 baseline. See SKaiNET-transformers #178.🤖 Generated with Claude Code